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Abstract 

Typical dimensionality reduction and feature 
extraction methods focus on directly reduc- 
ing the number of random variables under 
consideration and retain maximal variations 
in the data. In this paper, we consider the 
dimensionality reduction in high-dimension 
parameter space and uncover a Confident In- 
formation First (CIF) principle to maximally 
preserve parameters with most confident es- 
timates and screen out less reliable or noisy 
parameters. Formally, the confidence of a 
parameter can be assessed by its Fisher in- 
formation, which can establish a connection 
with the inverse variance of any unbiased 
estimate for the considered parameter via 
Cramer-Rao bound. The Restricted Boltz- 
mann Machine (RBM) is one live example 
that CIF has been successfully applied. We 
theoretically and empirically show that RBM 
with \^~\ hidden units and n visible units 
(2:1 RBM for short) could achieve a parsi- 
monious representation of data in the hidden 
variable space and in the mean time comply 
with CIF in parameter space. Then several 
layers of RBM are composed together to form 
up the deep structure in order to achieve ef- 
ficient compression ratio. CIF could help us 
understand how the deep neural network ad- 
mits its empirical success. 

1. Introduction 

Recently, (Hinton & Salakhutdinov, 2006) introduces 
a greedy layer-wise unsupervised learning algorithm 
for Deep Belief Networks (DBN), which obtained 
impressive results in reducing the dimensionality of 
data. The building block of a DBN is an expo- 
nential probabilistic model called Restricted Boltz- 
mann Machine (RBM). In (Bcngio et al., 2006) and 
(Salakhutdinov & Hinton, 2012), the RBM along with 
the greedy training algorithm is found to be applicable 
to models other than DBN. Due to the importance of 



RBM in learning deep architectures, the present pa- 
per studies RBM in terms of dimensionality reduction 
using Information Geometry (IG) (Amari & Nagaoka, 
1993). 

Typical dimensionality reduction (or feature extrac- 
tion) methods (Fodor, 2002) (Lee & Verleysen, 2007) 
focus on directly reducing the number of random 
variables under consideration by transforming the 
data from high-dimensional visible variable space 
to an intrinsic hidden space of lower dimensional- 
ity, while retaining maximal variations in the data. 
For example. Principle Components Analysis (PCA) 
(Abdi & Williams, 2010) finds the directions of the 
greatest variations in the data and represents the data 
by the coordinates along those directions. We could 
conclude these considerations as the approach to learn 
the features as disentangling as possible, and discard 
less significant variation of the data (Bengio et al., 

2012) . Although this Disentangling Feature First prin- 
ciple has been shown useful by many successful ap- 
plications (Fodor, 2002), the features capturing more 
variations in data do not necessarily mean that it is 
more informative to the task, especially in cases where 
the visible variables used to represent data are redun- 
dant or noisy. More formally, in the unsupervised con- 
figuration, the effectiveness of dimensionality reduc- 
tion on input variable space largely depends on reason- 
able distance metric, while the noisy features can dis- 
turb the reasonable distance metric and result in un- 
reasonable feature spaces. The situation is even worse 
when the sample data is insufficient or unrepresenta- 
tive (Duin & Pgkalska, 2006) (Jiang & Guo, 2007). To 
this end, a metric invariant model may be preferred 
in the unsupervised configuration (Hou et al.. 2010). 
It should be noted that the learning of common BM- 
based models does not explicitly depend on the metric 
of input feature spaces. Actually, in (Desjardins et al., 

2013) , they present a metric-free natural gradient for 
joint-training of Boltzmann Machines, which shows 
the potential that deep neural network indicates some 
metric-invariant nature. 
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From the probabilistic modeling perspective, the ques- 
tion of dimensionality reduction (or feature extraction) 
can be interpreted as an attempt to recover a parsimo- 
nious set of latent variables that describe a distribu- 
tion over the observed high-dimensional data, with a 
set of model parameters (Bengio et al., 2012). In gen- 
eral, a high-dimensional data often requires a high- 
dimensional parameter space in order to sufficiently 
depict the original data. However, overfitting would 
generally occur when the model is excessively com- 
plex. For example, (Dauphin & Bengio, 2013) empir- 
ically shows the failure of some big neural networks 
to leverage the added capacity to reduce underfitting. 
Hence it is important to understand what is the first 
principle on reducing the dimensionality of the param- 
eter space concerning the intrinsic complexity of the 
underlying model and a certain sample size. ^ 

In this paper, we propose the Confident Information 
First (GIF) principle to maximally preserve parame- 
ters with confident estimations and screen out noisy 
parameters (less reliable). Formally, the confidence ^ 
of a parameter can be assessed by its Fisher informa- 
tion (Amari & Nagaoka, 1993), which can establish a 
connection with the inverse variance of any unbiased 
estimate for the considered parameter via Cramer- Rao 
bound (Rao, 1945). Comparing to the traditional Dis- 
entangling Feature First strategy, the GIF gives us a 
more principled and context-independent viewpoint to 
deal with high-dimensional data in parameter space 
by a strategy that is of irrelevant to the input metric, 
where the context-independence is derived from the 
metric-invariant nature of GIF. 

The RBM is indeed one striking example that GIF 
has been successfully applied to. The present pa- 
per theoretically and empirically shows that RBM 
could achieve a parsimonious representation of data 
in the hidden variable space with sufficient numbers 
of hidden units (approximatively, m > \^~\ , where n 
is the number of visible units and m is the number 
of hidden units) and in the meantime comply with 
GIF in parameter space. As empirically shown in 
(Bengio et al., 2011), deep learners benefit more from 
out-of-distribution samples, which is consistent with 
our analysis in a way that GIF would give a reason- 
able smoothed generative distribution. 

More specifically, using the IG 

theory (Amari & Nagaoka, 1993), we study all proba- 

^In model selection, information criterions (AIC, AICc, 
BIC and etc) give some insights of the relationship between 
model complexity and sample size 

■^In present paper, the meaning of confidence is different 
from the common concept degree of confidence in statistics. 



bility distributions over visible variables on a general 
statistical manifold S in terms of a mixed-coordinate 
systems [^] = {t]} ,rifj,9l^'' , . .., ^i','-"'") (Amari et al., 
1992). We demonstrate that an RBM determines a 
submanifold of S, which could greatly reduce the rep- 
resentation dimensions while maximally preserving the 
expected local distance in 5* in terms of the unique in- 
variant distance determined by Fisher's information 
metric. This is demonstrated in two steps: 

• Single layer Boltzmann Machine (SBM) imple- 
ments the GIF principle by retaining the 1th and 
2nd order components (i.e., ril,r]fj), while fix- 
ing the less confident components 6';*^-' '''^ to zero 
{I > 3). We also prove that SBM maximally 
preserves the expected local distance on S, i.e., 
achieves the best approximation; 

• RBM could completely implement SBM (with n 
visible units) only if we introduce a sufficient num- 
ber of hidden units (approximatively, m> \'^~\)- 

It turns out that the practical 2 : 1 RBM (m « [§]) 
has two nice properties: a parsimonious representation 
in hidden variable space and a reliable representation 
in parameter space. In order to achieve efficient com- 
pression ratio, several layers of 2 : 1 RBMs are com- 
posed together to form up the deep structure (e.g., 
DBN) , where hidden units are trained to capture high- 
order dependencies of units in the lower layer. The 
present paper tries to uncover the mystery that lies in 
the deep networks and illuminate a more general fea- 
ture learning principle in terms of the intrinsic infor- 
mation confidence in the estimation of representation 
parameters. 

2. Theoretical Foundations of IG 

In IG, a family of probability distributions is consid- 
ered as a differentiable manifold with certain coordi- 
nate systems. In this section, we briefly introduce the 
theoretical foundations of IG for the manifold 5* of the 
open simplex of all probability distributions over bi- 
nary random variables a; g {0, 1}". 

2.1. Notations for Manifold S 

In IG, each probabilistic distribution could be rep- 
resented as a point on the manifold with the pa- 
rameters of the model as coordinates. In the 
case of binary random variables, four basic co- 
ordinate systems are often used: p-coordinates, 
77-coordinates, ^-coordinates and mixed-coordinates 
(Amari & Nagaoka, 1993) (Hou et al., 2013). Mixed- 
coordinates is of vital importance for our analysis. 

For the p-coordinates [p] with n binary variables, 
the probability distribution over 2" states of x can 
be completely specified by any 2" — 1 positive num- 
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bers indicating the probabihty of the corresponding 
exclusive states on n binary variables. For exam- 
ple, the p-coordinates of n = 3 variables could be 
[p] = (pooi,Poio, Poll, Pioo,Pioi,Piio, Pill)- Note that 
IG requires all probability terms are positive. 

For simplicity, we use the capital letters /, J, . . . to 
index the coordinate parameters of probabilistic dis- 
tribution. An index / can be regarded as a subset of 
{1,2, . . . And pj stands for the probability that all 
variables indicated by / equal to one and the comple- 
mented variables are zero. For example, if / = {1, 2, 4} 
and n = A, then pi = pnoi = Prob{xi = l,X2 = 
l,X3 = 0,X4 = 1). Note that the null set can also 
be a legal index of the p-coordinates, which indicates 
the probability that all variables are zero, denoted as 

P0...0- 

Another coordinate system often used in IG is rj- 
coordinates which is defined by: 



^E[Xi] = Prob{l[x, = l} 



(1) 



where Xj ~ Yiiei ^« expectation is taken with 

respect to the probability distribution of x. Grouping 
the coordinates by their orders, 77-coordinate system is 
denoted as [77] — {ilh''lij7 • ■ • , ^"2...n)j where the super- 
script indicates the order number of the corresponding 
parameter. For example, rff^ denotes the set of all ?/ 
parameters with the order number 2. 

The 0-coordiiiate (natural coordinate) is defined by: 
\ogp{x)= J2 e'Xi-i^ (2) 

/C{1,2 7i},I^NuUSet 

where -0 ~ — log Prob{xi = 0,Vi G {1,2, ...,n}}. 
We denote ^-coordinate as [0] = {0\,e'J , . . . ,ei'-'''), 
where the subscript indicates the order number of the 
corresponding parameter. For example, 02 denotes 
the set of all parameters with the order number 2. 
Note that the order indices locate at different positions 
in 77-coordinates and ^-coordinates, which follows the 
convention in (Amari et al., 1992). 

The relation between coordinate systems [77] and [0] is 
bijective (Amari et al., 1992). More formally, they are 
connected by the Legendre transformation: 



dipi0) 



(3) 



drji d0^ 
where '>p{0) and (jj^r]) meet the following identity 

V(^)+0(77)-^0V = O (4) 

The function ip{0) is introduced in Eq. (2): 



m-^^ogC£exp{yyxj{x)}) 



(5) 



and hence (/>(?/) is the negative of entropy: 
Hv) = ^Pix; 0{v)) logp(.T; 0{r])) 



(6) 



Next we introduce the mixed-coordinate system, which 
is important for our derivation of GIF principle. 
In general, the manifold S of probability distribu- 
tions could be represented in a mixed coordinate 
(Amari et al., 1992): 



[C] -(?/-,'?'■,. 



where the first part consists of ?7-coordinates with or- 
der less or equal to / (denoted by [r?'~]) and the second 
part consists of ^-coordinates with order greater than I 
(denoted by [6*/+]). Without loss of generality, we will 
study the GIF in the case of Z = 2. 

2.2. Fisher Information Matrix for Parametric 
Coordinates 

The Fisher information between two modeling param- 
eters is defined as the covariancc of the score. The 
Fisher information matrix can be proved to be the 
only invariant metric in the manifold of probability 
distributions, which is invariant to rcparameterization 
(Rao, 1945). Intuitively, the Fisher information mea- 
sures the amount of information from the data that a 
statistic carries about the unknown parameter (Kass, 
1989). Inspired by Gramer-Rao bound ^ (Rao, 1945), 
we could assess the confidence of certain parameter by 
using the lower bound of its estimate variance. GIF 
principle utilizes this hint to evaluate the confidence 
of a parameter in parameter space. 

For a general coordinate system [£], the Fisher infor- 
mation between two coordinates is (Rao, 1945): 



9^J = E[ 



dlogpjx;^) d\ogp{x;0 



96 



(7) 



The coordinate parameters ^ and arc called orthog- 
onal if and only if their Fisher information vanishes, 
i.e., gij = 0, meaning that their influences to the log 
likelihood function are uncorrelated. A more technical 
meaning of orthogonality is that the maximum likeli- 
hood estimations (MLE) of orthogonal parameters can 
be independently performed. 



The Fisher information for [0] can be rewritten as 
and for [ry] it is g" = 



9I.J 



^For a probabilistic model p{x; ^) with multiple param- 
eters C = (6, ■ --^id) £ and T{x) = (Ti(a;), . . . ,Td[x)) 
is an unbiased estimator of ^, then the Cramer-Rao bound 
states that the covariance matrix Cov{T{x)) > G'^^ , where 
G5 is the Fisher information matrix for ^ and ">" means 
that Cov{T{x)) — G^^ is positive definite. 
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(Amari & Nagaoka, 1993). Let Go = (gij) and Gjj = 
{g^'^) be the Fisher information matrices for [6] and [q], 
respectively. It can be shown that Ge and G,, are mu- 
tually inverse matrices, i.e., "^jg^'^gjK = Sj^, where 
6j( = 1 a I = K and otherwise (Amari & Nagaoka, 
1993). 

Next two propositions (Proposition 2.1 and 2.2) gen- 
erally calculate Ge and G^,. Note that Proposition 2.1 
is an extension of Theorem 2 in (Amari et al., 1992). 
Proposition 2.1 The Fisher information between 
two parameters in [9], namely 9^ and 9'\ is given by 

Proof in Appendix 7.1 | 

Proposition 2.2 The Fisher information between 
two parameters in namely rji and rjj, is given by 

9"iv)= E (-l)l^-^l+l'^-^l • ^ (9) 
— Pk 

where \ ■ \ denotes the cardinality operator. 
Proof in Appendix 7.2 | 

For the probability distribution of three variables 
with the p-coordinates (pooi,Poio,Poii,Pioo,Pioi,Piio, 
Pill), we could calculate the Fisher information of 
77/ and ?7j based on Eq. (9). For / = {1,2} and 
J = {2,3}, g''^ = ^ + For / = {1,2} and 

J = {1, 2, 3}, g''^ = -(^ + ^ + ^ + ^). 

>- ' ' J ' ^ ^POOO POIO pioo piio ' 

3. GIF in Mixed-coordinates 

In this section, we will formally present the GIF prin- 
ciple in terms of reducing the dimensionality of pa- 
rameter space on the manifold S. Let the mixed- 
coordinates be [C] = {r]},r]f^,9l^'',...,9l^'---'''), where 
the first part consists of 77-coordinates with order less 
or equal to 2 (denoted by and the second part 

consists of ^-coordinates with order greater than 2 (de- 
noted by [^2+]). Proposition 3.2 gives a closed form for 
calculating the Fisher information matrix G^. 

First, we introduce a lemma indicating the relation- 
ship between the Fisher information matrix of two dif- 
ferent coordinate systems with common parameters. 
Consider two coordinate systems of the same proba- 
bility distribution: [^] and [C] with Fisher information 
matrices G^ and G^ respectively. Assume [^] and [(] 
share a common set of parameters 8 = ([^] n [C]), and 
the indices of 8 in [£] and [(] are and respectively. 

Lemma 3.1 (G^"'^)/j, = (G^^)/^ where (•)/ denotes 
the sub-matrix determined by the index /. 

Proof According to Cramer- Rao bound (Rao, 1945), 
a parameter (or a pair of parameters) has a unique 



asymptotically tight lower bound of the variance (or 
covariance) of unbiased estimate, which is given by the 
corresponding element of the inverse of Fisher infor- 
mation matrix involving this parameter (or this pair 
of parameters) . Then the lemma follows directly. | 

Based on Lemma 3.1, the Fisher information matrix 
for mixed-coordinates [(,] is given as follows: 

Proposition 3.2 The Fisher information matrix of 
the mixed- coordinates [(] is given by: 

Gc=(^°) (10) 

where A = i{G-'),J-', B = UGg')j,)-\ G„ and 

Gg are the Fisher information matrices of the corre- 
sponding coordinates, and Ir^ is the index set of the 
parameters shared by [rf\ and [Q, i.e., {?7i ,??fj}, and Jg 
is the index set of the parameters shared by [9] and [Q], 

i.e., {0f 

Proof in Appendix 7.3. | 

Proposition 3.3 The diagonal elements of A are 
lower bounded by I, and those of B are upper bounded 

by I 

Proof in Appendix 7.4. | 

From Gf, it is easy to check that the confidences 
of coordinate parameters entail a natural hierarchy: 
the first part parameters [r]'^~] with higher confidences 
are separated from the second part parameters [^2-1-] 
with lower confidences. Let us consider an example 
of the three-variable distribution whose p-coordinates 
and mixed-coordinates are [p] = (pooi = 0.15, poio = 
0.1, Poll = 0.05, pioo = 0.2,pioi = 0.1,piio = 
0.05, Pill = 0.3) and [(] = (?7i, 772, ?73, '7i2, ?7i3, te, 6'^^^) 
respectively. Then the corresponding confidence for 
each parameter Q is given by the diagonal element 
g{Q) of Gc: g{m) - 18.18,5(^2) - 20.53,. g(%) = 
18.42,3(7712) - 22.37,3(7713) = 23.39,3(7723) = 
23.16,3(61123) ^ 0.01. We can see that the confidence 
of parameters in part [77^"] are much larger than those 
of [02-1- ]■ Moreover, the Fisher information between 
two parameters in [7?^"] and [6*2-1-] vanishes, indicat- 
ing that we could estimate these two parts separately. 
Hence we can implement the CIF principle of dimen- 
sionality reduction on parameter space [C] by replacing 
the parameters of low confidence with the fixed neutral 
value zero and reconstructing the result distribution. 
It turns out that the submanifold tailored by CIF be- 
comes [C'] = (77^^, 77,^^, 0, . . . , 0). It is interesting that 
SBM exactly implements the CIF principle, which is 
explained in detail in the next section. 
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4. The Geometry of BM 

4.1. Notations in BM 

A general Boltzmann Machine (BM) (Ackley et al., 
1985) is defined as a stochastic neural network con- 
sisting of visible units x G {0, 1}'''" and hidden units 
h € {0,1}'*'', where each unit fires stochastically de- 
pending on the weighted sum of its inputs. The energy 
function is defined as follows: 

EBM(x,h;n = --x'^Ux^-h'^Vh- x^Wh 
i^iuy , 2 2 (11) 

- b^x - (fh 

where ^ = {U,V,W,b,d} are the parameters: visible- 
visible interactions (f/), hidden-hidden interactions 
{V), visible- hidden interactions {W), visible self- 
connections (6) and hidden self-connections (d). The 
diagonals of U and V are set to zero. We can express 
the Boltzmann distribution over the joint space of ob- 
served variables x and the latent variables h as below: 

p{x, h; = ^exp{-EBM{x, h; 0} (12) 

where Z is a normalization factor. 

Next we will focus on two special BMs: 1) the sin- 
gle layer BM without hidden units (SBM); 2) the re- 
stricted BM (RBM) restricting the interactions in Eq. 
(11) only to those between h and x. The energy func- 
tions for SBM and RBM are respectively given by: 

Esbm{x) = -\yUx - h^x (13) 
EiiBM{x,h) = -]^x^Wh-h^x-(fh (14) 

Actually, the parameterizations used by BM (i.e., 
{U, V, W, 6, d}) could be naturally reformulated in the 
coordinate systems in IG (e.g., [9]). For example, for 
an SBM of n visible units with parameters {U,b}, we 
have the 0-coordinate as follows: [9] = {9\ = hi, 92 = 
t/.„0f ==O,...,0i^2-^" = O). 

4.2. GIF and Projections on BM 

Given an underlying probability distribution q{x) on 
the general manifold S, we consider the problem of re- 
alizing it by a BM (with stationary distribution p(a;)) 
as faithfully as possible, i.e., minimize the distance 
between q{x) and p{x) under the Fisher information 
metric. Let BM be a smooth submanifold of 5, then 
the projection of an arbitrary point Q e S* on BM is 
given by the point P G BM that is closest to Q. In 
this section, we mainly investigate the projections on 
two typical BMs, i.e., SBM and RBM. 

4.2.1. SBM Projection 

For an arbitrary point Q on 5 with mixed-coordinates 
iv},v!j,0T, ■ • ■ , Ot-"), (Amari et al., 1992) proves 



that the SBM projection P of Q on the SBM manifold 
is exactly Csbm{P) = ('7^^ '7^^, , 0, ...). On the other 
hand, to our best knowledge, the GIF principle un- 
derlying the SBM projection has not been formally 
addressed in literature. Hence GIF can be considered 
as a theoretical justification for SBM. Moreover, we 
will show that the SBM projection maximally preserve 
the expected local information distance among all sub- 
manifolds with "("^^-'-^ dimensions, which uncovers a 
geometric interpretation for applying GIF in SBM. 

Proposition 4.1 For all tailored ^^^^^^ -dimension 
submanifolds of S in mixed coordinates [C] (by setting 
certain set of coordinates to zero), the SBM projection 
maximally preserves the expected local information dis- 
tance in terms of Fisher information metric. 

Proof Let i?p be a e-ball surface centered at P on 
manifold 5, i.e., Bp ~ {( ^ '5'IIIC ~ CpIU — e}, where 
II • II 2 denotes the Euclid norm and is the coordi- 
nates of P. Let P + dP be a neighbor of P uniformly 
sampled on i?p and Cp+dP be its corresponding coor- 
dinates. For a small e, we can calculate the expected 
information distance between P and P-\-dP as follows: 

Eb,^ J[{Cp+dp-CpfGi;{Cp+dp-Cp)]^-dBp (15) 

where is the Fisher information matrix at P. 

Since Fisher information matrix is both positive 
definite and symmetric, there exists a singular value 
decomposition (SVD) Gq = U^AU where U is an or- 
thogonal matrix and A is a diagonal matrix with diag- 
onal entries equal to the eigenvalues of Gq (all > 0). 

Apply the SVD into Eq. 15, the distance becomes: 

EB^=j[{Cp+dp-C,pfU^AU{Cp+dp-Cp)]^dBp (16) 

Note that U is an orthogonal matrix, and the transfor- 
mation U{Cp+dP — Cp) is a- norm-preserving rotation. 

Now we need to show that among all tailored IlilLtii. 
dimension submanifolds of S, SBM is the one that 
preserves maximum information distance. Assume 
It = «2, ■ • ■ , i I is the index of "^"^"^^-^ coordi- 
nates that we choose to form the tailored submanifold 
T in the mixed-coordinates [(]. Based on Eq. (16), 
the expected information distance Epp for T is pro- 
portional to the sum of eigenvalues of the sub-matrix 
(G^)/y, where the sum equals to the trace of (G^)/^. 

Next wc show that the sub-matrix of G^ specified 
by SBM gives maximum trace. Based on Proposi- 
tion (3.3), the elements on the main diagonal of the 
sub-matrix A are lower bounded by one, and those of 
B are upper bounded by one. Therefore, SBM gives 
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maximum trace among all sub- matrices of G(;. This 
completes the proof. | 

In summary, SBM confines the manifold of probabil- 
ity distributions in the parameter subspace spanned by 
those directions with greatest Fisher information and 
maximally preserves the information distance. Thus 
we can see that SBM is indeed an exact implementa- 
tion of GIF we propose in Section 1. 

Next section will show that an REM can simulate SBM 
with certain number of hidden variables and it has the 
power of reducing dimensions in both variable space 
and parameter space. 

4.2.2. RBM Projection 

Let RBMn^m be a RBM with n visible units and m 
hidden units and produce a stationary distribution 
p{x,h;£,iiBM„^^) where £,RBM,^^rn is the coordinate pa- 
rameters for RBMn,m- Similarly, SBMn denotes a 
SBM with n visible units. Let S^h be the general man- 
ifold of probability distributions over the joint space of 
visible units x and hidden units h, and be the gen- 
eral manifold over visible units x. It is easy to see that 
RBMn.m defines a submanifold of Sxh- 

Given an observation distribution q{x) S '^x: our 
goal is to find the p(a;, /i; ^flBA/„ „J G Sxh with the 
marginal distribution p{x; ^rbm„ that best approx- 
imates q{x). Let Eg = {q{x,h) £ SxhlJ^h^i^^^) = 
q{x)} be the submanifold of Sxh that has the same 
marginal distribution q{x). It can be proved that the 
minimum KuUback-Leibler (K-L) divergence between 
q(x) and p{x; ^RBM„_^) is equal to the minimum di- 
vergence between distributions in Eg and RBMn^m 
(Amari et al., 1992). 

As shown in Section 4.2.1, SBMn exactly implements 
GIF principle and achieves a faithful approximation 
on Sx- Therefore it is natural to investigate the rela- 
tion between RBMn,m and SBMn- To be specific, we 
want to know whether it is possible for a RBMn,m to 
implement a certain SBMn and confine the marginal 
probability p(a;; ^) within the manifold SBMn- 

Intuitively, the number of hidden units is vital 
for RBM in terms of modeling capacity. In 
(Roux & Bengio, 2008), they proved that adding hid- 
den units yields strictly improved modeling power of 
RBM, and RBM is a universal approximator of dis- 
crete distributions. In this section, we study the lower 
bound of the number of hidden units in RBMn.m to 
completely approximate the probability distributions 
in SBMn- 

Proposition 4.2 RBMn,m could exactly implement 
any SBMn only if 



Proof Let £,rbm„ ,„ and £,sbm„ be the parameters 
of RBMn,m and SBMn respectively, and their corre- 
sponding sets of probability distributions over visible 
variables be ^i?BM„_„) andp(a;; ^sba/„)- Based on 
Eq. (2), S,sBM„ has the dimensionality in terms 

of the number of free parameters. On the other hand, 
as shown by (Gueto et al., 2010), p{x;(,rbm„.^) has 
the dimensionality: 

min{nm + n + m, 2" — 1} 

for m < 2"-r'°92(»+i)l or m > 2"-L'°S2(n-Hi)J ^ This 
proposition is valid in most cases since we usually 
have m much smaller than 2"^r'o52(n+i)l ^ n implies 
that RBMnjn is identifiable, i.e., the parametriza- 
tion of the model is locally one-to-one with respect 
to {p{x; ^RBM„ ,„)}■ Therefore, RBMn^m can exactly 
implement any SBMn only if RBMn.m has at least 
the same dimensionality as SB Mm which requires 

nin+l) ^ I , ■ 

—^^"2 — - < in + n + mn- | 

On theoretical sides. Proposition 4.2 gives a neces- 
sary condition that SBMn can be implemented by 
RBMn,m It is interesting that practical RBMs and 
deep learning structures often use an approximate 2 ; 1 
RBM (to w [|]) as their building blocks. This ob- 
servation matches the analysis in Proposition 4.2 and 
further implies that to sa [^] may be a sufficient condi- 
tion that RBMn,m can implement SBMn- Intuitively, 
RBMn.m can implement SBMn when the number of 
parameters of the former is solidly larger than that of 
the latter. However, a rigorous proof requires to show 
the strict inclusion relation between two semi-algebraic 
in an exact problem setting or two algebraic varieties 
in tropical geometry (Mikhalkin, 2006), which poses 
some complexities of analysis. We hence leave the for- 
mal justification as further work and investigate the 
sufficiency of 2 : 1 RBM empirically in Section 5. 

On practical sides, the settings of RBM reflect a trade- 
off of the following two contrary requirements. First, 
a parsimonious representation requires the number of 
hidden units of RBM to be as less as possible (requires 
TO < n). Second, as shown in Proposition 4.2, RBM 
could faithfully simulate SBM only if to > [ 2(71-1-1) "I' 
It turns out that the 2 : 1 RBM (to ss [§]) becomes 
a reasonable setting since the scale of hidden units is 

■'Note that Proposition 4.2 can not be proved by count- 
ing number of model parameters, since it is well-known 
that there exist bijections or continuous maps of R onto 
i?", as pointed out by Cantor in 1877 and by Peano in 
1890. 
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both parsimonious for representation and necessary for 
modeling SBM. This issue will be evaluated in Section 
5; which empirically shows that the RBM achieves the 
most stable performance roughly at m sa . 

4.2.3. Discussions on deep architecture 
Several layers of RBM are composed together to form 
up a deep structure in order to achieve efficient com- 
pression ratio, where hidden units are trained to cap- 
ture high-order dependence of units in lower layers. 
We propose that the deep architecture determines a 
submanifold M of 5", where M could greatly reduce the 
representation dimensions while preserving the most 
confident information on parameters. Hence the GIF 
principle, which deals with the dimensionality reduc- 
tion in parameter space, gives us a rational expla- 
nation about the motivation of the deep representa- 
tions in terms of the tradeoff between intrinsic infor- 
mation confidence and the compactness of representa- 
tion, w.r.t both parameters and variables. 

5. Experimental Results 

In this section, we first validate Propositions 4.2 on 
simulated data sampled from a SBM stationary distri- 
bution. Then, we further illustrate how the number 
of hidden units in RBM affects its approximation for 
a general probability distribution (not necessarily a 
SBM stationary distribution) compared to SBM. 

Experimental Setup: Here we adopt the distribu- 
tions over 10 visible variables with several high-order 
dependencies embedded (by raising their correspond- 
ing marginal probabilities) as the underlying distri- 
bution. Then training samples of any size can be 
generated. For the first case, those samples are fur- 
ther used to train SBM, whose stationary distribu- 
tion serves as the target for training RBM. For the 
second case, those samples are directly used in train- 
ing SBM and RBM. The Contrastive Divergence algo- 
rithm (Salakhutdinov & Hinton, 2012) is adopted for 
training BMs. 

Evaluation Metric: K-L divergence is used to eval- 
uate the goodness-of-fit of the stationary distribution 
of BM we trained w.r.t the underlying distribution 
(Roux & Bengio, 2008). Usually a smaller K-L diver- 
gence means a better fit. In this experiment, each case 
study runs multiple times (100 times) and we report 
the averaged K-L divergences. 

5.1. Simulating SBM by RBM 

Propositions 4.2 gives us a necessary condition for the 
lower bound of hidden layer size m of RBM in order to 
completely implementing all possible SBM (with the 
same number of visible units as in RBM, say n). We 
conjecture that this could also be a sufficient condi- 



tion, meaning that an RBM with m « \^] provides 
sufficient degree of freedom to simulate SBM. In Fig. 
1(a), we could see that the K-L divergence between the 
trained RBM and target SBM stationary distribution 
decreases rapidly as m varies from 2 to 4. and then 
achieves a steady performance at m > 5 with a lit- 
tle fluctuation (even slightly worse performance with 
a larger m). This could be explained as follows: Our 
target distribution is indeed a SBM stationary distri- 
bution, which has n(n+l)/2 parameters, and we could 
achieve a robust model for this distribution with ap- 
proximately the same dimension of parameter space. 
This result confirms our conjecture that an RBM with 
m w hidden variables would be enough for model- 
ing a distribution specified by a SBM with n units in 
the case of relatively sufficient sampling. 

5.2. Influence of RBM Hidden Layer Size 

In this section, we illustrate how the hidden layer size 
m of RBM influence the performance of density esti- 
mation. Here we assume a complex distribution with 
several significant high-order patterns. Then both 
RBM and SBM are used to approximate the under- 
lying distribution in two situations: large and small 
samples. The K-L divergences between RBMs (with 
different to) and underlying distribution are shown in 
Fig. 1(b) and 1(c). 

In the case of large sample in Fig. 1(b), the K-L di- 
vergence for RBM declines dramatically when m varies 
from 1 to 10, and then tends to increase steadily as m 
increases. In the case of small sample in Fig. 1(c), 
the K-L divergence for RBM drops sharply when m 
rises from 1 to 4 and then bottomed out at to = 5. 
Comparing the cases of small and large sampling, the 
threshold for to when RBM becomes overfitting with 
small sample is much lower than the large case. These 
observations could be explained as follows: First, as 
the model grows excessively complex (with increasing 
to), RBM tends to gradually model more about the 
sampling bias (i.e., noise) in data instead of the in- 
trinsic character of the underlying distribution and re- 
sults in overfitting. Second, a smaller sample prefers 
a simpler model. More specifically, the training pro- 
cess of RBM is assumed to search for the parameter 
settings that best describe the true distribution. We 
have shown before that SBM realizes the GIF principle 
and maximally preserves the most confident informa- 
tion, which means that our most confident guess about 
true distribution based on the given data lies in the set 
of probability distributions bounded by the SBM man- 
ifold. As to rises, the set of probability distributions 
bounded by RBM would gradually grow and eventu- 
ally submerge the set bounded by SBM, i.e., the vol- 
ume ratio between the set bounded by SBM and the 
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(a) (b) (c) 

Figure 1. (a) shows the approximation performance of RBM w.r.t a SBM stationary distribution with 600 samples; (b) 
and (c) show the performance of RBM and SBM w.r.t a general distribution with sampling size 100 and 10 respectively. 
Note (a) does not compare RBM with SBM since the target distribution is a stationary distribution of SBM 



set bounded by RBM becomes zero. In practice, there 
are often thousands of visible variables and the sam- 
pling is often highly insufficient. Then an excessively 
strong RBM would pose huge risk of training RBM to 
be a distribution beyond SBM manifold and result in 
less reliable approximations to the true distribution. 
We hence conclude that the most faithful choice for m 
is approximately the bound given in Proposition 4.2. 

6. Conclusions 

For dimensionality reduction, we expect to represent 
the input manifold of complicated shapes into spaces 
where distributions arc much simpler and it should be- 
come much easier to model the joint distribution be- 
tween high-level abstractions in the sense of requiring 
much less data to learn (Bcngio et al., 2012). The GIF 
principle proposed in this paper is motivated to fulfil 
this expectation by preserving parameters with most 
confident estimations and discarding less reliable pa- 
rameters in the process of dimensionality reduction in 
parameter space. The RBM is particularly interesting 
to this purpose: achieve a parsimonious representa- 
tion in hidden variable space and a reliable represen- 
tation in parameter space (using the metric-invariant 
GIF principle). This answers the question proposed 
in the beginning. The GIF gives us a principled and 
context-independent way to deal with complex input 
data. In the future work, we would further develop 
the formal justification of GIF and its applications in 
deep neural networks. 

7. Appendix 

7.1. Proof of Proposition 2.1 
Proof By definition, wc have: 



91 J 



(18) 



where ^^{O) is defined by Eq. (4). Hence, we have: 

_ d\Y.i(^'m-m) _ dm 

By differentiating 77/ with respect to 6'\ we have 



^ (19) 



drj^ ^ dY.^Xi{x){exp{Y.j6iXi(x)~m}) 

= Y,Xi{x)[Xj{x)^T^j]p{x-e) 

X 

= 111 [jJ- ViVJ (20) 

From Eq.(19) and (20), we complete the proof. | 

7.2. Proof of Proposition 2.2 
Proof By definition, we have: 



^ drjidrij 
where 0(77) is defined by Eq. (4). Hence we have 



9 



drjidrjj 



(21) 



(22) 



Based on Eq. (2), the 9^ could be calculated by solving 
a linear equation system, as follows: 

E(-l)""'"MPi^) (23) 

KCI 

Based on Eq. (1), could be calculated by a linear 
combination of rjj coordinates as follows: 

(24) 



Pk 



= E(-i) 



J-K\ 



KCJ 



Therefore, the partial derivation of 9^ w.r.t t]j is: 

dpK 



89' _ 89' 
8iij ^ dpK 



8j]j 



PK 



(25) 



Ktzin.] 

Based on Eq. 22 and 25, we complete the proof. | 
7.3. Proof of Proposition 3.2 

Proof The Fisher information matrix of [C] could be 

f A C \ 

partitioned into four parts: *-^C ^ ( ^ B 1 ' 

can be verified that in the mixed coordinate Q, the 9- 
coordinate of order k is orthogonal to any 77-coordinate 
less than /c-order, impling the corresponding element 
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of Fisher information matrix is zero {C = D = 0) 
(Nakahara & Amari, 2002). Hence, is a block 
diagonal matrix. Based on Lemma (3.1), we have 
(G^-i),, ^ (G-i),„ and {G^')j^ = iGg')j,, i.e., 
G7^ = ( ^'^vji'^ y g.^^^^ ^ ^^^^^ 

\ L) [Gg )jg J 

tridiagonal matrix, the proposition follows. | 

7.4. Proof of Proposition 3.3 

Proof Assume the Fisher information matrix of [6] be 

Ge = ^ -^T y- ^ I which is partitioned based on 

and Jg. Based on Proposition 3.2, we have A = U~^. 
Obviously, the diagonal elements of U are all smaller 
than one. According to the succeeding Lemma 7.1, we 
can see that the diagonal elements of A {i.e., U~^) are 
greater than 1. 

Next we need to show that the diagonal elements of B 
are smaller than 1. Using the Schur complement of Gg, 
the bottom-right block of Gg^ , i.e., {Gg^)jg, equals 
to {V - X'^lf-'^X)-^. Thus the diagonal elements of 
B: Bjj [V - X'^U-^X)jj < Vjj < 1. Hence we 
complete the proof | 

Lemma 7.1 Let H be a I x I positive definite matrix. 
IfH^, < 1, then > 1, Vi G {1, 2, . . . , /}. 

Proof Since H is positive definite, it is a Gramian ma- 
trix of I linearly independent vectors wi, W2, . . • , w;, i.e., 
Hij = {vi,Vj) ((•, •) denotes the inner product). Simi- 
larly, H^^ is the Gramian matrix of / linearly indepen- 
dent vectors Wi,W2, . . . ,wi and {H~^)ij = {wi,Wj). It 
is easy to verify that {wi, v.^) = l,Vi G {1, 2, . . . , /}. If 
Hii < 1, we can sec that the norm \\vi\\ = ^/ Ha < 1. 
Since \\wi\\ x > {wi,Vi) = 1, we have \\wi\\ > 1. 
Hence, {H^^)u = {wi,Wi) = > 1. | 
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